A New Probabilistic Model of Text Classi cation and

ثبت نشده
چکیده

This paper introduces the multinomial model of text classiication and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The multinomial model employs independence assumptions which are similar to assumptions made in previous probabilistic models , particularly the binary independence model and the 2-Poisson model. The use of simulation to study the model is described. Performance of the model is evaluated on the TREC-3 routing task. Results are compared with the binary independence model and with the simulation studies.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparison of Event Models for Naive Bayes Text Classi cation

Recent approaches to text classi cation have used two di erent rst order probabilistic models for classi ca tion both of which make the naive Bayes assumption Some use a multi variate Bernoulli model that is a Bayesian Network with no dependencies between words and binary word features e g Larkey and Croft Koller and Sahami Others use a multinomial model that is a uni gram language model with i...

متن کامل

A New Probabilistic Model of Text Classi cation and Retrieval

This paper introduces the multinomial model of text classiication and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The mult...

متن کامل

Title: Hierarchically Classifying Documents Using Very Few Words Authors: Hierarchically Classifying Documents Using Very Few Words

The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. One can use existing classi ers by ignoring the hierarchical structure, treating the topics as separate classes. Unfortunately, in the context of text categorization, we are faced with a large number of classes and a huge number of relevan...

متن کامل

Combining weak learners with probabilistic models for binary classication

In this project it is addressed the problem of binary classi…cation i.e. we want to …nd a model F : X! f0; 1g for the dependence of a class c on x from a set of samples D = fct;xtgt=1 with ct 2 f0; 1g. The approach is to combine multiple base learners with the use of probabilistic models. In particular, I used the methodology presented by Ghersen…eld for doing regression in complex time series[...

متن کامل

Text Classification from Labeled and Unlabeled Documents Using

This paper shows that the accuracy of learned text classi ers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classi cation problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from lab...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996